variableDetails.csv worksheetvignettes/variableDetails.Rmd
variableDetails.RmdvariableDetails.csv
The variableDetails worksheet contain details for the variables in variables.csv. Information from variableDetails.csv worksheet is used by the RecWTable() function of the bllflow package to transform variables identifed in variableDetails$variableFrom to the newly transformed variable in variableDetails$variable.
#> In the `variableDetails.csv` worksheet there are 965 rows and 16 columns
Each row in variableDetails.csv holds the recode rules for transforming a single category for a variable in variables.csv. An exception to this rule are the “don’t know”, “refusal”, and “not stated” categories, which are combined as a single missing category. For each unique variable, an else row is used to assign values not identified in other rows. We recommend not combining variables across the CCHS if variable has an important change between CCHS cycles variableDetails$notes is used to identify issues that may be relevant when transforming the variable or category.
Additional information how to create and use variableDetails.csv is in the bllflow package. The bllflow package includes additional helper functions for creating variableDetails.csv using the CCHS Data Document Initiative (DDI) files located in ..\cchsflow\inst\extdata\CCHS_DDI.
If a categorical variable has 4 distinct categories, along with a “not applicable” category and the 3 missing categories, there will be 7 rows:
4 for each distinct category
1 for the not appliacble category
1 for the missing categories
1 else row.
RecWTable() uses the tagged_na() function from the haven package to tag not applicable responses as NA(a), and missing values (don’t know, refusal, not stated) as NA(b). As you will see later, not applicable values are transformed to NA::a, and missing values are transformed to NA::b.
The following are the columns that are listed in variableDetails.csv. Many of these columns need to be specified in order for RecWTable to be functional:
variableDetails.csv, we have designated the variable names used in CCHS cycles from 2007 to 2014 as the final transformed variable name.N/A. The name of a dummy variable consists of the final variable name, the number of categories in the variable, and the category level for each category. Note that this column is not necessary for RecWTable.
cat; while a transformed variable that is continuous will be specified as cont.The CCHS database identifier and other infomation can be extracted using bllflow::ReadDDI()
CCHS2001_DDI <- bllflow::ReadDDI(file.path(getwd(), "../inst/extdata/CCHS_DDI"), "cchs-82M0013-E-2001-c1-1-general-file.xml")
cat('Dataset name: ', unlist(CCHS2001_DDI$ddiObject$codeBook$docDscr$citation$titlStmt$titl))
#> Dataset name:
#> Canadian Community Health Survey, 2001: Cycle 1.1, General File
cat('ID No: ', unlist(CCHS2001_DDI$ddiObject$codeBook$docDscr$citation$titlStmt$IDNo))
#> ID No:
#> cchs-82M0013-E-2001-c1-1-general-file
cat('abstract: ', unlist(CCHS2001_DDI$ddiObject$codeBook$stdyDscr$stdyInfo$abstract))
#> abstract: The Canadian Community Health Survey (CCHS) is a cross-sectional survey that collects information related to health status, health care utilization and health determinants for the Canadian population. The CCHS operates ona two-year collection cycle. The first year of the survey cycle .1 is a large sample, general population health survey, designed to provide reliable estimates at the health region level. The second year of the survey cycle.2 is a smaller survey designed to provide provincial level results on specific focused health topics.
#> <br>
#> This Microdata File contains data collected in the first year of collection for the CCHS (Cycle 1.1). Information was collected between September 2000 and November 2001, for 136 health regions, covering all provinces and territories. The CCHS (Cycle 1.1) collects responses from persons aged 12 or older, living in private occupied dwellings. Excluded from the sampling frame are individuals living on Indian Reserves and on Crown Lands, institutional residents, full-time members of the Canadian Armed Forces, and residents of certain remote regions.The categorical age variable in the 2001 CCHS survey is DHHAGAGE. If the final variable name for categorical age in the variable column is DHHGAGE, you would write the following in this column: cchs-82M0013-E-2001-c1-1-general-file::DHHAGAGE
The categorical age variable in the CCHS surveys from 2007 to 2014 is DHHGAGE. Since it is the same as the final variable name, you would write in this column [DHHGAGE] once. The variable name that is denoted within the square brackets is the default variable name.
cat and continuous variables are denoted as cont.copy so that the function copies the values without any transformations. For the not applicable category, write NA::a. For missing & else categories, write NA::b
variableDetails.csv, variables that have gone from cat to cont have used midpoints of each category.numValidCat = N/A. Not applicable, missing, and else categories are not included in the category count. Note that this column is not necessary for RecWTable().N/A. Note, the function will not work if there different units between the rows of a variable.The rules for each category of a new variable are a string in recFrom and value in recTo. These recode pairs are the same syntax as {sjmisc::rec() – for more details see bllflow::RecWTable(). Recode pairs are obtained from the RecFrom and RecTo columns multiple values that are recoded into a new single value are separated with comma, e.g. recFrom = "1,2"; recTo = 1 value range is indicated by a colon, e.g. recFrom= "1:4"; recTo = 1 (recodes all values from 1 to 4 into 1} value range for double vectors (with fractional part), all values within the specified range are recoded; e.g. recFrom = "1:2.5"; recTo = 1 recodes 1 to 2.5 into 1, but 2.55 would not be recoded (since it’s not included in the specified range) minimum and maximum values are indicates by min (or lo) and max (or hi), e.g. recFrom = "min:4"; recTo = 1 (recodes all values from minimum values to 4 into 1) else is used all other values, which have not been specified yet, are indicated by else, e.g. recFrom = "else"; recTo = NA (recode all other values (not specified in other rows) to “NA”)} copy the else token can be combined with copy, indicating that all remaining, not yet recoded values should stay the same (are copied from the original value), e.g. recFrom = "else"; recTo = "copy" NA ….. Warsame….
bllflow helper functions. See bllflow documentation.recode-with-table function. Things to include here would be changes in wording between CCHS surveys, missing/changes in categories, and changes in variable type between CCHS surveys.This example will show how the transformed BMI variable was developed using variableDetails.csv. This is a continuous variable that has remained fairly constant in CCHS cycles between 2001 and 2014.
HWTGBMI. This should be written for each row.cont in each row of BMI.cont in each row of BMI.copy written. For the not applicable rows NA::a is written. For the missing and else rows NA::b is written.N/A is written in each row.BMI is written. Not applicable rows not applicable is written. Missing rows: missing. Else row: else
body mass index is written to give further detail on what BMI is. The other rows remain the same.kg/m2 is written in each row.11.91:57.9. In the 2001 and 2003 CCHS surveys not applicable was coded as 999.6 so the recFrom for this row would be 999.6:999.6. Similarly, in the 2001 and 2003 CCHS surveys don’t know was coded as 999.7, refusal was coded as 999.8, and not stated was coded as 999.9. Therefore the recFrom for the missing row for CCHS 2001 and 2003 would be 999.7:999.9. In the not applicable row for the 2005 CCHS survey onwards, the recFrom is 999.96:999.96. In the missing row for CCHS 2005 onwards, the recFrom is 999.97:999.99. For the else row, just write else.BMI / self-report (D,G) is written as it is written in CCHS documentation. The other rows remain the same, and the values for each missing category are stated in the missing rows.BMI for each row is sufficient for this variable.BMI / self-report - (D,G).